Linear Regression

Association between two measurements. As scientists, we often try to decipher the relationship between phenomena. Is thing 1 associated with thing 2? Or ultimately, does thing 1 influence thing 2? does thing 1 cause thing 2.

Iris Dataset

Anderson: catalog natural variation in Iris species.

Fisher: figure out how to classify species by different relationships.

By Frank Mayfield - originally posted to Flickr as Iris virginica shrevei BLUE FLAG, CC BY-SA 2.0, https://commons.wikimedia.org/w/index.php?curid=9805580

Setosa attribution: CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=170298

# Fisher's Iris data
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
cor(iris$Petal.Length,iris$Petal.Width)
## [1] 0.9628654
# Correlation is high, but there is more than one group.
# Fisher was trying to build a classifier
#with(iris, plot(Petal.Length, Petal.Width, col=Species, main="iris dataset"))
gp = ggplot(iris, 
           aes(Petal.Length, Petal.Width, col=Species)) + 
  geom_point() + 
  ggtitle("iris dataset") 

print(gp) # plots the previous commands

levels(iris$Species)
## [1] "setosa"     "versicolor" "virginica"
iris.mod = lm(Petal.Width ~ Petal.Length, data=iris)

# add the regression line to the plot
gp + geom_abline(intercept = iris.mod$coefficients['(Intercept)'], 
                 slope = iris.mod$coefficients['Petal.Length'])

summary(iris.mod)
## 
## Call:
## lm(formula = Petal.Width ~ Petal.Length, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56515 -0.12358 -0.01898  0.13288  0.64272 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.363076   0.039762  -9.131  4.7e-16 ***
## Petal.Length  0.415755   0.009582  43.387  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2065 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

The last line shows the F-statistic/ANOVA. You can perform that test directly on the model via anova(iris.mod), or aov(iris.mod) for a slightly different interface.

  • The relationship may not necessarily be interesting to the Fisher and Anderson’s original goals, but…
  • is this a good dataset for linear regression?
  • Does the fit accurately describe the relationship between the variables, given the separation in groups?

Keep these questions in mind.

Build up linear regression:

  • from a straight line
  • adding noise
  • adding outliers
  • adding groups

How does linear regression perform on these types of data, and what do we do when we encounter it?

Let’s reverse engineer the process of regression a little, and then start shaking it up.

The mechanics of linear regression

We’ll start by plotting points on a perfect line, then add variation to those points.

# equation for a line is: y = mx + b
m = .5 # choose a slope
b = 3
# plot points on a line
N = 20 # number of points
x = 1:N
y = m * x + b
plot(x, y, main="Points along a line")

# The fit returns the parameters m and b
perfect = lm(y ~ x)
perfect
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##         3.0          0.5
# summary(perfect)
# In summary.lm(perfect) : essentially perfect fit: summary may be unreliable
# add the fit line to the plot
plot(x, y, main="Perfect Line with fit")
abline(perfect) # "perfect" is the variable returned by "lm(y ~ x)"

# linear regression equation is: y = B0 + B1x + e
# the values of y deviate from the line by the error term e

set.seed(0) # try a different number for different randoms
y = b + m * x + rnorm(N, mean=0, sd=1) # the error is normally distributed, mean = 0, standard deviation = 1
plot(x, y, 
     xlim=c(0, max(x)), 
     ylim=c(0, max(y)),
     pch=20,
     main="Plot with normally distributed error added")
abline(h=0) # horizontal axis
abline(v=0) # vertical axis
model = lm(y ~ x)
abline(model) # add the fitted line to the plot
segments(x, y, x, predict(model), col="red") # add residuals to plot

# Plot: A line with (more) variation
set.seed(0)
y = b + m * x + rnorm(N, mean=0, sd=3) # notice the increase in standard deviation
plot(x, y, 
     xlim=c(0, max(x)), 
     ylim=c(min(0,y), max(y)),
     pch=20,
     main="straight line, more error")
# axes
abline(h=0) # horiz
abline(v=0) # vert
# fit the line
model.wider = lm(y ~ x)
abline(model.wider)
segments(x, y, x, predict(model.wider), col="red")

summary(model)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.81859 -0.65877 -0.06829  0.70369  2.37527 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.65252    0.45466   8.034 2.31e-07 ***
## x            0.43769    0.03795  11.532 9.55e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9788 on 18 degrees of freedom
## Multiple R-squared:  0.8808, Adjusted R-squared:  0.8742 
## F-statistic:   133 on 1 and 18 DF,  p-value: 9.55e-10
summary(model.wider)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4558 -1.9763 -0.2049  2.1111  7.1258 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   4.9576     1.3640   3.635   0.0019 **
## x             0.3131     0.1139   2.749   0.0132 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.936 on 18 degrees of freedom
## Multiple R-squared:  0.2958, Adjusted R-squared:  0.2566 
## F-statistic: 7.559 on 1 and 18 DF,  p-value: 0.01319

How do outliers affect the fit?

Let’s add an outlier.

### Use a tighter spread.
set.seed(0)
y = b + m * x + rnorm(N, mean=0, sd=2)

### model2: x2, y2. Add a single outlier.
x2 = c(x, 1.5)
y2 = c(y, 12)
model2 = lm(y2 ~ x2)
mod.diag = augment(model2) # 'augment' is from package 'broom'- it calculates diagnostics
head(mod.diag)
## # A tibble: 6 x 8
##      y2    x2 .fitted .resid   .hat .sigma .cooksd .std.resid
##   <dbl> <dbl>   <dbl>  <dbl>  <dbl>  <dbl>   <dbl>      <dbl>
## 1  6.03     1    5.77  0.259 0.159    2.50 0.00127      0.116
## 2  3.35     2    6.06 -2.71  0.135    2.40 0.113       -1.20 
## 3  7.16     3    6.35  0.806 0.115    2.49 0.00807      0.352
## 4  7.54     4    6.65  0.899 0.0973   2.49 0.00815      0.389
## 5  6.33     5    6.94 -0.610 0.0823   2.49 0.00307     -0.262
## 6  2.92     6    7.23 -4.31  0.0700   2.27 0.127       -1.84
cooksd = mod.diag$.cooksd
sum(cooksd > 4/N) # this counts the number satisfying "cookds > 4/N"
## [1] 1
# Plot: We see the outlier in the upper left shift the line slightly.
plot(x2, y2, 
     xlim=c(0, max(x2)), 
     ylim=c(min(0,y2), max(y2)),
     pch=ifelse(cooksd > 4/N, 4,20), # 'X' for the outliers (pch=4), pch=1 otherwise
     main="Outlier added (x)",
     sub="dashed: model w/o outlier"
     ) 

abline(h=0); abline(v=0)
abline(model, lty=2)
abline(model2)
segments(x2, y2, x2, predict(model2), col="red")

### model3: x3,y3. Add another outlier on the lower right
x3 = c(x2, 20.5)
y3 = c(y2, 0)
model3 = lm(y3 ~ x3)
mod.diag = augment(model3) # 'augment' is from package 'broom'
cooksd = mod.diag$.cooksd
sum(cooksd > 4/N)  
## [1] 2
cooksd[cooksd > 4/N]
## [1] 0.2516166 0.9620788
# Plot: We see that this new outlier pulls the slope drastically downward.
plot(x3, y3,
     xlim=c(0, max(x3)), 
     ylim=c(min(0,y3), max(y3)),
     pch=ifelse(cooksd > 4/N, 4,20), # 'X' for the outliers (pch=4), pch=20 otherwise
     main="Second outlier added", 
     sub="dashed: model w/o outliers")
abline(h=0) # x-axis
abline(v=0) # y-axis
abline(model, lty=2) # original fitted line (dashed, lty=2), before outliers
abline(model3) # line with outliers added, (solid)
segments(x3, y3, x3, predict(model3), col="red") # residuals

When the outlier drives the entire association

set.seed(0)
# model4: x4, y4. Random cloud + extreme outlier.
# a random cloud centered at 10,5
x4 = rnorm(N, 10) 
y4 = rnorm(N, 5)
# single outlier in lower right
x4 = c(x4, 20)
y4 = c(y4, 0)
model4 = lm(y4 ~ x4)

# regression diagnostics
mod.diag = augment(model4)
cooksd = mod.diag$.cooksd
sum(cooksd > 4/N)  
## [1] 1
cooksd[cooksd > 4/N]
## [1] 16.98731
summary(model4)
## 
## Call:
## lm(formula = y4 ~ x4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3515 -0.8630  0.0499  0.6371  1.3540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.20107    0.88769  10.365 2.94e-09 ***
## x4          -0.41366    0.08271  -5.001 7.92e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8873 on 19 degrees of freedom
## Multiple R-squared:  0.5683, Adjusted R-squared:  0.5456 
## F-statistic: 25.01 on 1 and 19 DF,  p-value: 7.924e-05
plot(x4,y4,
     pch=ifelse(cooksd > 4/N, 4,1), # 'X' for the outliers (pch=4), pch=1 otherwise
     main="Random cloud with outlier added",  # give an 'X' for the outliers
     sub="dashed: model w/o outliers")
     
abline(model4)

# model4.censored: a new model ignoring the outliers (via Cook's d)
df = data.frame(x=x4,y=y4)
model4.censored = lm(y ~ x, data=df[cooksd < 4/N,])
abline(model4.censored, lty=2) # dashed line: model with outliers removed.

IF your data has outliers, run the regression without them and see if the association still exists, or changes.

Are they outliers if they lie together?

# model5: x5,y5. More outliers in the lower right.
# add more
x5 = c(x4, 18.5)
y5 = c(y4, 0.8)
x5 = c(x5, 19.5)
y5 = c(y5, 2)
model5 = lm(y5 ~ x5)


mod.diag = augment(model5)
cooksd = mod.diag$.cooksd
sum(cooksd > 4/N)  
## [1] 2
cooksd[cooksd > 4/N]
## [1] 0.5417482 0.2878865
summary(model5)
## 
## Call:
## lm(formula = y5 ~ x5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34814 -0.84870  0.05036  0.72489  1.35432 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.12865    0.65251  13.990 4.09e-12 ***
## x5          -0.40675    0.05583  -7.285 3.57e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8798 on 21 degrees of freedom
## Multiple R-squared:  0.7165, Adjusted R-squared:  0.703 
## F-statistic: 53.08 on 1 and 21 DF,  p-value: 3.569e-07
plot(x5,y5,
     pch=ifelse(cooksd > 4/N, 4,1),  # 'X' for the outliers (pch=4), pch=1 otherwise
     main="Outliers close together", 
     sub="dashed: model w/o outliers")
abline(model5)

df = data.frame(x=x5,y=y5)
model5.censored = lm(y ~ x, data=df[cooksd < 4/N,])

abline(model5.censored, lty=2)

summary(model4.censored)
## 
## Call:
## lm(formula = y ~ x, data = df[cooksd < 4/N, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3960 -0.4850 -0.1938  0.6252  1.1028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  4.56682    1.65739   2.755    0.013 *
## x            0.05449    0.16495   0.330    0.745  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7345 on 18 degrees of freedom
## Multiple R-squared:  0.006026,   Adjusted R-squared:  -0.04919 
## F-statistic: 0.1091 on 1 and 18 DF,  p-value: 0.7449
summary(model5.censored)
## 
## Call:
## lm(formula = y ~ x, data = df[cooksd < 4/N, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.34693 -0.82997  0.04098  0.67133  1.34442 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.87576    0.98601   9.002 2.78e-08 ***
## x           -0.38156    0.09298  -4.104 0.000605 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8756 on 19 degrees of freedom
## Multiple R-squared:  0.4699, Adjusted R-squared:  0.4419 
## F-statistic: 16.84 on 1 and 19 DF,  p-value: 0.0006048

Motivating example

ChIP-seq for chromatin marks. Value is relative signal in a mutant background versus wild type, log scale.

Biological Question: Does a certain gene affect activating/repressing chromatin states?

Experimental Approach- mutate that gene, measure the two different chromating states (via histone marks) in mutant and wildtype worms.

The measurement is done genome wide by an experiment called ChIP-seq, which used antibodies specific histone mark to capture the DNA that presents that mark, and sequences the capture regions. Those regions are then identified by their sequence using next-generation sequencing, and the intensity is related to the amount the histones are modified and how often the region is captured in the population of isolated sample.

What type of genes are targetted for the histone modifications facilitated by the gene of interest?

Statistical approach: Is there a negative correlation between the histone mark intensities in certain types of genes, versus others.

Original data

# ChIP-seq for chromatin marks. Value is relative signal in a mutant background versus wildtype, log scale.

# intensity of repressive mark for a set of genes (each value = gene)
h3k9me2 = c(-2.25, -1.75, -1.83, -1.17, 0.04, -1.37, -0.97, -2.06, -0.39, -0.40, -1.34, -1.36, -1.32, -1.42, -0.98, -1.12, -1.06, -1.32, -1.32, -0.06, -0.17, -0.03, -0.22, -0.03, -0.15, -0.05, -0.23, 0.06, -0.12, -0.05, 0.20, 0.10, -0.27, -0.39, -0.33, -0.39, -0.46, -0.55, -0.53, 0.26, 0.35, 0.20, 0.11, 0.06, 0.05, -0.32, -0.30, -0.62)

# intensity of activating mark for a set of genes (each value = gene)
h3k4me3 = c(2.25, 1.97, 0.95, 0.52, -0.07, 0.56, 0.76, 1.31, 0.52, 0.87, 0.44, 0.89, 0.96, 0.20, 0.10, -0.08, 1.81, 1.55, 1.50, -0.08, -0.06, -0.08, -0.11, -0.01, -0.09, -0.07, -0.07, -0.07, -0.02, 0.06, -0.10, -0.13, -0.14, -0.15, -0.04, 0.01, 0.06, 0.01, -0.09, -0.25, -0.25, 0.02, 0.07, 0.22, 0.17, 0.12, 0.27, 0.20)

plot(h3k9me2, 
     h3k4me3, 
     xlim=c(-3,3), 
     ylim=c(-3,3),
     xlab="Repressing Signal: (H3K9me2) mutant / WT)",
     ylab="Activating Signal (H3K4me3) mutant / WT)",
     pty='s',
     main="HTA-germline genes")
abline(h=0)
abline(v=0)
text(2, -2, sprintf("r = %.3f", cor(h3k9me2,h3k4me3)))

df = data.frame(h3k9me2, h3k4me3)
lm.1 = lm(h3k4me3 ~ h3k9me2, data=df)
lm.1.diag = augment(lm.1)
cooksd = lm.1.diag$.cooksd

plot(h3k9me2, 
     h3k4me3, 
     xlim=c(-3,3), 
     ylim=c(-3,3),
     xlab="Repressing Signal: (H3K9me2) mutant / WT)",
     ylab="Activating Signal (H3K4me3) mutant / WT)",
     pch=ifelse(cooksd > 4/N, 4,1),
     pty='s',
     main="HTA-germline genes (original model plus outliers removed)",
     sub="dashed: outliers removed")
abline(h=0)
abline(v=0)

# original 
abline(lm.1)


# with outliers removed
lm.1.censored = lm(h3k4me3 ~ h3k9me2, data=df[cooksd < 4/N,])
abline(lm.1.censored, lty=2)

Question: Is there a better way to distinguish the correlation between histone marks for genes affected by the mutant?

Regression on mixed populations

Searching around for regression diagnostics, trying to describe the problem in google, I eventually turn up Simpson’s paradox.

Simpson’s paradox (Yule-Simpson effect): A trend within groups deviates, or even contradicts, a trend seen when the groups are aggregated.

Can be resolved when confounding variables and causal relations are appropriately addressed in the statistical modeling (wikipedia)

##Illustration

By Pace~svwiki - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=62007681

Simpsons package in R

Clusters may be provided, or calculated using mixture modeling. It then runs a type of modeling called beta regression on the different clusters and compares to the overall model.

‘permutationtest’.

  1. A regression is considered significantly different from the group if the difference in beta estimate exceeds the lower or upper 2.5 percent of the permuted null distribution. If this is the case, a warning is issued as follows: “Warning: Beta regression estimate in cluster X is significantly different compared to the group!”.
  2. If the sign of the regression within a cluster is different (positive or negative) than the sign for the group and the beta estimate deviates significantly, a warning states “Sign reversal: Simpson’s Paradox! Cluster X is significantly different and in the opposite direction compared to the group!”
ggplot(simp.histone$alldata, aes(X,Y,color=factor(clusterid))) + 
  geom_point() + 
  geom_abline(
    slope=simp.histone$Allbeta[1], 
    intercept = simp.histone$Allint[1], 
    color="#F8766D") + 
  geom_abline(
    slope=simp.histone$groupbeta, 
    intercept = simp.histone$groupint, 
    color="grey", size = 2, alpha = .5) +
  geom_abline(
    slope=simp.histone$Allbeta[2], 
    intercept = simp.histone$Allint[2], 
    color="#00BFC4")  # you can find the 1st 2 ggplot colors with hue_pal()(2)

correlation_by_group = simp.histone$alldata %>% 
  group_by(clusterid) %>% 
  summarize(cor(X,Y))
## `summarise()` ungrouping output (override with `.groups` argument)
print(correlation_by_group)
## # A tibble: 2 x 2
##   clusterid `cor(X, Y)`
##       <dbl>       <dbl>
## 1         1      -0.495
## 2         2      -0.318

The intrinsic clustering did a good job of separating the groups.

The outcome is: cluster 1 has a slope of -0.6798802, not as sharp as the original -0.7421426, whereas the second cluster is much flatter -0.1589689.

The correlation coefficients show that the separation of the groups caused the high correlation.

coef(simp.histone)
##            N         Int       Beta         Pval
## Alldata   48 -0.07975478 -0.7421426 1.371252e-11
## Cluster 1 18  0.06391150 -0.6798802 3.664584e-02
## Cluster 2 30 -0.04268135 -0.1589689 8.719577e-02
summary(simp.histone)
## $regressionResults
##            N         Int       Beta         Pval
## Alldata   48 -0.07975478 -0.7421426 1.371252e-11
## Cluster 1 18  0.06391150 -0.6798802 3.664584e-02
## Cluster 2 30 -0.04268135 -0.1589689 8.719577e-02
## 
## $mclustResults
## 'Mclust' model object: (VEV,2) 
## 
## Available components: 
##  [1] "call"           "data"           "modelName"      "n"             
##  [5] "d"              "G"              "BIC"            "loglik"        
##  [9] "df"             "bic"            "icl"            "hypvol"        
## [13] "parameters"     "z"              "classification" "uncertainty"
cluster(simp.histone)
## $`Cluster 1`
##        X     Y clusterid
## 1  -2.25  2.25         1
## 2  -1.75  1.97         1
## 3  -1.83  0.95         1
## 4  -1.17  0.52         1
## 6  -1.37  0.56         1
## 7  -0.97  0.76         1
## 8  -2.06  1.31         1
## 9  -0.39  0.52         1
## 10 -0.40  0.87         1
## 11 -1.34  0.44         1
## 12 -1.36  0.89         1
## 13 -1.32  0.96         1
## 14 -1.42  0.20         1
## 15 -0.98  0.10         1
## 16 -1.12 -0.08         1
## 17 -1.06  1.81         1
## 18 -1.32  1.55         1
## 19 -1.32  1.50         1
## 
## $`Cluster 2`
##        X     Y clusterid
## 5   0.04 -0.07         2
## 20 -0.06 -0.08         2
## 21 -0.17 -0.06         2
## 22 -0.03 -0.08         2
## 23 -0.22 -0.11         2
## 24 -0.03 -0.01         2
## 25 -0.15 -0.09         2
## 26 -0.05 -0.07         2
## 27 -0.23 -0.07         2
## 28  0.06 -0.07         2
## 29 -0.12 -0.02         2
## 30 -0.05  0.06         2
## 31  0.20 -0.10         2
## 32  0.10 -0.13         2
## 33 -0.27 -0.14         2
## 34 -0.39 -0.15         2
## 35 -0.33 -0.04         2
## 36 -0.39  0.01         2
## 37 -0.46  0.06         2
## 38 -0.55  0.01         2
## 39 -0.53 -0.09         2
## 40  0.26 -0.25         2
## 41  0.35 -0.25         2
## 42  0.20  0.02         2
## 43  0.11  0.07         2
## 44  0.06  0.22         2
## 45  0.05  0.17         2
## 46 -0.32  0.12         2
## 47 -0.30  0.27         2
## 48 -0.62  0.20         2
plot(simp.histone)

It’s not the full-fledged paradox because the signs aren’t reversed, but the groups in the data have different associations,or some have an association where others don’t.